Scenario: A company approaches you to predict data scientist salaries with machine learning.
December 4, 2017
Scenario: A company approaches you to predict data scientist salaries with machine learning.
Machine learning is a method for teaching computers to make and improve predictions or behaviours based on data.
Kaggle conducted an industry-wide survey of data scientists. https://www.kaggle.com/kaggle/kaggle-survey-2017
Information asked:
set.seed(42)
task = makeRegrTask(data = survey.dat, target = 'CompensationAmount')
lrn = makeLearner('regr.randomForest', importance=TRUE)
mod = train(lrn, task)
"There is a problem with the model!"
ice = generatePartialDependenceData(mod, task, features ='Age',
individual = TRUE)
plotPartialDependence(ice) + scale_y_continuous(limits=c(0, NA))
ice.c = generatePartialDependenceData(mod, task, features ='Age',
individual = TRUE, center = list(Age=20))
plotPartialDependence(ice.c)
pdp = generatePartialDependenceData(mod, task, features =c('Age'))
plotPartialDependence(pdp) + scale_y_continuous(limits=c(0, NA))
"We want to understand the model better!"
feat.imp = getFeatureImportance(mod, type=1)$res dat = gather(feat.imp, key='Feature', value='Importance') %>% arrange(Importance) dat$Feature = factor(dat$Feature, levels = dat$Feature) ggplot(dat) + geom_point(aes(y=Feature, x = Importance))
pdp = generatePartialDependenceData(mod, task, features =c('Gender'))
ggplot(pdp$data) + geom_point(aes(x=Gender, y=CompensationAmount)) +
geom_segment(aes(x=Gender, xend=Gender, yend=CompensationAmount), y=0) +
scale_y_continuous(limits=c(0, NA)) +
theme(axis.text.x = element_text(angle = 10, hjust = 1))